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Abstract. This paper investigates domain generalization: How to take knowl- 
edge acquired from an arbitrary number of related domains and apply it to 
previously unseen domains? We propose Domain-Invariant Component Anal- 
ysis (DICA), a kernel- based optimization algorithm that learns an invariant 
transformation by minimizing the dissimilarity across domains, whilst pre- 
serving the functional relationship between input and output variables. A 
learning-theoretic analysis shows that reducing dissimilarity improves the ex- 
pected generalization ability of classifiers on new domains, motivating the pro- 
posed algorithm. Experimental results on synthetic and real-world datasets 
demonstrate that DICA successfully learns invariant features and improves 
classifier performance in practice. 



1. Introduction 

Domain generalization considers how to take knowledge acquired from an arbi- 
trary number of related domains, and apply it to previo usly unseen domains. To 



illustrate the problem, consider an example taken from iBlanchard et al.l f|201 lh 
which studied automatic gating of flow cytometry data. For each of N patients, a 
set of rii cells are obtained from peripheral blood samples using a flow cytometer. 
The cells arc then labeled by an expert into different subpopulations, e.g., as a 
lymphocyte or not. Correctly identifying cell subpopulations is vital for diagnosing 
the health of patients. However, manual gating is very time consuming. To au- 
tomate gating, we need to construct a classifier that generalizes well to previously 
unseen patients, where the distribution of cell types may differ dramatically from 
the training data. 

Unfortunately, we cannot apply standard machine learning techniques directly 
because the data violates the basic assumption that training data and test data 
come from the same distribution. Moreover, the training set consists of heteroge- 
neous samples from several distributions, i.e., gated cell s from several patients. In 



this case, the data exhibits coyariate (or dataset) shift (fWidmer and Kuratl 11996 



Quionero-Candela et al. 20091 . Bickel et al. 2009b ): although the marginal distri 



butions fx on cell attributes vary due to biological or technical variations, the 
functional relationship P(Y|X) across different domains is largely stable (cell type 
is a stable function of a cell's chemical attributes). 
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Figure 1. A simplified schematic diagram of the domain gener- 
alization framework. A major difference between our framework 
and most previous work in domain adaptation is that we do not 
observe the test domains during training time. See text for detailed 
description on how the data are generated. 



A considerable effort has been made in domain adaptation and trans f er lea rning 
to remedy this problem, see Pan and Yang ( 2010a ). Ben-David et al. ( 201dl ) and 
references therein. Given a test domain, e.g., a cell population from a new patient, 
the idea of domain adaptation is to adapt a classifier trained on the training domain, 
e.g., a cell population from another patient, such that the generalization error on the 
test domain is minimized. The main drawback of this approach is that one has to 
repeat this process for every new patient, which can be time-consuming - especially 
in medical diagnosis where time is a valuable asset. In this work, across-domain 
information, which may be more informative than the domain-specific information, 
is extracted from the training data and used to generalize the classifier to new 
patients without retraining. 

1.1. Overview. The goal of (supervised) domain generalization is to estimate a 
functional relationship that handles changes in the marginal F(X) or conditional 
P(F|A) well, see Figure [TJ We assume that the conditional probability P(Y|X) is 
stable or varies smoothly with the marginal F(X). Even if the conditional is stable, 
learning algorithms may still suffer from model misspecification due to variation in 
the marginal F(X). That is, if the learning algorithm cannot find a solution that 
perfectly captures the functional relationship between X and Y then its approxi- 
mate solution will be sensitive to changes in F(X). 

In this paper, we introduce Domain Invariant Component Analysis (DICA), a 
kernel-based algorithm that finds a transformation of the data that (i) minimizes 
the difference between marginal distributions Fx of domains as much as possible 
while (ii) preserving the functional relationship F(Y\X). 

The novelty of this work is twofold. First, DICA extracts invariants: features 
that transfer across domains. It not only minimizes the divergence between mar- 
ginal distributions F(X), but also preserves the functional relationship encoded in 
the posterior F(Y\X). The resulting learning algorithm is very simple. Second, 
while prior work in domain adaptation focused on using data from many different 
domains to specifically improve the performance on the target task, which is ob- 
served during the training time (the classifier is adapted to the specific target task) , 
we assume access to abundant training data and are interested in the generaliza- 
tion ability of the invariant subspace to previously unseen domains (the classifier 
generalizes to new domains without retraining). 
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Moreover, we show that DICA generalizes or is closely related to many well- 
known dimens ion reduction algorithms including kernel princ ipal component anal- 
ysis (KPCA) dScholkopf et al.lll998l . iFukumizu et all l2004af ) . transfer component 
analysis fTCA) (|Panet al.l2011 ). and covariance operator inverse regression (COIR) 
( Kim and Pavlovkl 2011 ). see H2.5I The performance of DICA is analyzed theoret- 
ically ij2.6l and demonstrated empirically 



1.2. Related work. Domain generalization is a form of transfer learning, which 
appl ies expertise acquired i n source domains to improve learning of target domains 
(cf. i an and Yand (|2010al ) and references therein). Most previous work assumes 
the availability of the target domain to which the knowledge will be transferred. In 
contrast, domain generalization focuses on the generalization ability on previously 
unseen domains. That is, the test data comes from domains that are not available 

during train ing. 

Recently. iBlanchard et al. ( 2011 ) proposed an augmented SVM that incorporates 
empirical marginal distributions into the kernel. A detailed erro r analysis showed 
univer sal consistency of the approach. We apply methods from IBlanchard et al 



(2011) to derive theoretical guarantees on the finite sample performance of DICA. 

Learning a shared subspace is a common approach in settings where there is 
distribution mismatch. For example, a typical approach in multitask learning is 
to uncover a joint (latent 1 feature /subspace that benefits tasks ind ividually (Ar- 
gyriou et al. 20071 . Gu and Zhoul 20091 . Passos et al.l 2012h . A similar idea has been 
adopted in domain adaptation, w here the learned subspace reduces m ismatch be- 
tween source and target domains ( Gretton et al. 20091 . Pan et al. 2011 ). Although 
these approaches have proven successful in various applications, no previous work 
has fully investigated the generalization ability of a subspace to unseen domains. 

2. Domain-Invariant Component Analysis 

Let X denote a nonempty input space and y an arbitrary output space. We 
define a domain to be a joint distribution Pxy on X x y, and let tyxxy denote 
the set of all domains. Let tyx and ^P^a 1 denote the set of probability distributions 
Px on X and Py|x on Y given X respectively. 

We assume domains are sampled from probability distribution @* on tyxxy 
which has a bounded second moment, i.e., the variance is well-defined. Domains 
are not observed directly. Instead, we observe N samples S = {S i } 1 jL 1 , where 

?xy i s sampled 



{(%k\yk?)YkLi i s sampled from f XY and each 



Hi 

xv • 



from Since in general P X y ^ ^xv' the samples in S are not i.i.d. Let P* 
denote empirical distribution associated with each sample S l . For brevity, we use 
P and Px interchangeably to denote the marginal distribution. 

Let H and T denote reproducing kernel Hilbert spaces (RKHSes) on X and y 
with kernels k : X x X — > K and I : y x y — > R, respectively. Associated with 
W and T are mappings x — > 4>{x) £ % and y — > <p(y) E J- induced by the kernels 
•) and /(•,•). Without loss of generality, we assume the feature maps of X and 
Y have zero means, i.e., J2k=i <t>{ x k) = = J2k=i fiVk)- Let S xx , S yy , S xy , and 
E yx be the covariance operators in and between the RKHSes of X and Y. 

2.1. Objective. Using the samples S, our goal is to produce an estimate / : tyx x 
X — > M that generalizes well to test samples S f = drawn according to 

some unknown distribution P* G tyx I Blanchard et al.ll201 lh . Since the performance 
of / depends in part on how dissimilar the test distribution P* is from those in 
the training samples, we propose to preprocess the data to actively reduce the 
dissimilarity between domains. Intuitively, we want to find transformation B in % 
that (i) minimizes the distance between empirical distributions of the transformed 
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samples B(S l ) and (ii) preserves the functional relationship between X and Y, i.e., 
Y JL X | B(X). We formulate an optimization problem capturing these constraints 
below. 



2.2. Distributional Variance. First, we define the distributional variance, which 
measures the dissimilarity acr oss domains. It is convenient to represent di stribu- 
tions as elemen ts in an RKHS dBerlinet and Agnarj|2004, ISmola et al.ll2007l. Sripe- 
rumbudur et al. 120101 ) using the mean map 



(1) 



a : tyx -> fi : / k(x, ■) dP(x) =: fi r 



x 



We assume that k(x, x) is bounded for any x S X such that E xr ^p[k(x, ■)] < oo. If k 
is characteristic then (TQ) is infective, i.e., all the information about the distribution 
is preserved ( Sriperumbudur et al. 2010t ). It also holds that Ep[/] = {uv, f)u f° r 
all f en and any P. 

We decompose 3? into &x , which generates the marginal distribution ¥x , and 
£Py\x, which generates posteriors Pyix- The data generating process begins by 
generating the marginal Px according to &*x- Conditioned on Px, it then generate 
conditional Py|x according to &*y\x- The data point (x,y) is generated according 
to P x and ¥ Y \x, respectively. Given set of distributions V = {P 1 , P 2 . . . , P^} 
drawn according to S^x, define N x N Gram matrix G with entries 



(2) 



G, 



H 



k{x 1 z)A¥\x)<W 3 {z) 



for i,j = 1, . . ., N. Note that Gij is the inner product between kernel mean em- 
beddings of P* and P J in W. Based on @, we define the distributional variance, 
which estimates the variance of the distribution &x- 

Definition 1. Introduce probability distribution V on % with V(fipi) = ~k and 



center G to obtain the covariance operator of V , denoted as £ := G 
GIn + ljvGljv- The distributional variance is 



InG 



(3) 



1 1 1 N 

H {V) := -tr(E) = -tr(G) - - £ G 



The following theorem shows that the distributional variance is suitable as a 
measure of divergence between domains. 



Theorem 1. Let 

1 II 



h y^f— i P ? - If k is a characteristic kernel, thenY-u{V) 



ii 2 



if and only if P 1 = P 2 



To estimate Wn(V) from N sample sets S — {S i }^L 1 drawn from . 
define block kernel and coefficient matrices 



we 



( K U1 



K = 



K U 



N 



N,\ 



where n = YliLi n * anc ^ [Ki-j]k,i = k(x^' , xf) is the Gram matrix evaluated be- 
tween the sample S 1 and S-* ■ Following ((3]), elements of the coefficient matrix 



K 



N,N 



,Q = 



Q 



UN 



V Q 



N.l 



Q 



N,N 



Q 



lj equal (TV — l)/(N 2 nf) if i = j, and —l/(N 2 ninj) otherwise. Hence, 



the empirical distributional variance is 



(4) 



r n (S) = jjtT(Z)=tv(KQ) 
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Theorem 2. The empirical estimator V%(<S) = -^tr(E) = ti(KQ) obtained from 
Gram matrix 

rii rij 

1 3 k=i i=i 
is a consistent estimator ofW-uiV). 

2.3. Formulation of DICA. DICA finds an orthogonal transform B onto a low- 
dimensional subspace (m <C n) that minimizes the distributional variance V-h(S) 
between samples from 5, i.e. the dissimilarity across domains. Simultaneously, we 
require that B preserves the functional relationship between X and Y, i.e. Y _L 
X\B{X). 

2.3.1. Minimizing distributional variance. In order to simplify notation, we "flat- 

ten" {(4 l) I yi l) )fe=i}iIito{(x fe , 2 / fe )}^ 1 wheren = EL^- Let b* = E?=i##*i) 
^^jSfc be the fc" 1 basis function of £? where & x = [(p(xi), 4>{x2), • ■ • , 4>{x n )] and /3fe 
are n-dimensional coefficient vectors. Let B — [/3i,/32, • ■ • , @ m ] and $ x denote the 
projection of $ x onto b fc , i.e., $ x = bjT^i = /3,T$J$ X = /3jif. The kernel on the 
£>-projection of X is 

(5) K := §Jf x = KBB T K . 

After applying transformation £>, the empirical distributional variance between sam- 
ple distributions is 

(6) V H (BS) = tr(KQ) = tr(B T KQKB) . 

2.3.2. Preserving the functional relationship. The central subspace C is the minimal 
subspace that captures the functional relationship between X and Y, i.e. Y _L 
A|C T A. Note that in this work we generalize a linear transformation C T X to 
nonlinear one B(X) . To find the central subspace we use the inverse regression 
framework, 

Theorem 3. If there exists a central subspace C = [ci, . . . , c m ] satisfying Y _L 
A|C T A, and for any a € R d , E[a T X\C T X] is linear in [c] X}\ n =l , then E[X\Y] C 

It follows that the bases C of the central subspace coincide with the m largest 
eigenvectors of V(ELY|Y]) premultiplied by E xx . Thus, the basis c is the solution 
to the eigenvalue problem V(ELY|Y])£ xx c = 7£ xx c. Alternatively, for each c/- one 
may solve 



max - : 



cTE- x 1 V(E[X|T])S xx c fc 



Ck eR d c k c k 

under the condition that is chosen to not be in the span of the previously chosen 
Cfe. In our case, x is mapped to 4>{x) £ H induced by the kernel k and B has 
nonlinear basis functions £ H, k = 1, . . . , m. This nonlinear extension implies 
that E[X|Y] lies on a function space spanned by |I! xx Cfc}^_ 1 , which coincide wit h 
the eigenfunctions of the operator V(E[X|Y]) (|Wull2008l . Kim and Pavlovidl201lf l. 
Since we always work in H, we drop <fi from the no t at ion b elow . 

To avoid slicing the output space explicitly (jLi 1991 . Wu 2008 ). we exploit its 
kernel structure w hen estimating the covar iance of the inverse regressor. The fol- 
lowing result from Kim and Pavlovicl (|201ll ) states that, under a mild assumption, 
V(E[X|Y]) can be expressed in terms of covariance operators: 

Theorem 4. If for all f G "H, there exists g <E T such that E[/(A)|y] = g(y) for 
almost every y, then 

(7) V(E[X|Y]) = ExyE^Eyx • 
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Let § y — [<p(yi), ■ • ■ , <p(y n )] and L = $T <& y . The covariance of inverse regressor 
Q) is estimated from the samples S as V(E[X|Y]) = E^E^Ey* = ^ X L(L + 
ne/„) _1 $J where E xy = ^x^y and E yy = i$ y <j>^. Assuming inverses E^y 1 and 
E xx exist, a straightforward computation (see Supplementary) shows 

bfc E xx 1 V(E[X|F])E xx b fc = -f3j L(L + nel^K 2 ^ 

n 

(8) bjb fc = /32#/3fc, 

where e smoothes the affinity structure of the output space Y, thus acting as a kernel 
regularizer. Since we are interested in the projection of <p(x) onto the basis functions 
bfc, we formulate the optimization in terms of (3k- For a new test sample Xt, the 
projection onto basis function bfc is k t /3fe, where k t = [k(xi,x t ), k(x n , x t )]- 

2.3.3. The optimization problem. Combining ([5]) and ©, DICA finds B = [f3\ , @2, ■ ■ 
that solves 

itr^i^ + ne^)- 1 ^ 2 ^) 

(9) max 



bgR-x- tr (B T KQKB + BKB) 

The numerator requires that B aligns with the bases of the central subspace. 
The denominator forces both dissimilarity across domains and the complexity of B 
to be small, thereby tightening generalization bounds, see £12.61 Rewriting ((9]) as a 
constrained optimization (see Supplementary) yields Lagrangian 

(10) C =— tr (B T L(L + neI n )~ 1 K 2 B) — tr {{B T KQKB + BKB — I m ) T) , 

where T is a diagonal matrix containing the Lagrange multipliers. Setting the 
derivative of (|10j) w.r.t. B to zero yields the generalized eigenvalue problem: 

(11) —L(L + neI n )^ 1 K 2 B — {KQK + K)BT . 
n 

Transformation B corresponds to the m leading eigenvectors of the generalized 
eigenvalue problem (fTTjFl . 

The inverse regression framework based on covariance operators has two benefits. 
First, it avoids explicitly slicing the output space, which makes it suitable for high- 
dimensional output. Second, it allows for structured outputs on which explicit 
slicing may be impossible, e.g., trees and sequences. Since our framework is based 
entirely on kernels, it is applicable to any type of input and output variables, as 
long as the corresponding kernels can be defined. 

2.4. Unsupervised DICA. In some application domains, such as image denois- 
ing, information about the target may not be available. We therefore derive an 
unsupervised version of DICA. Instead of preserving the central subspace, unsu- 
pervised DICA (UDICA) maximizes the variance of X in the feature space, which 
is estimated as Ht(B t K 2 B). Thus, UDICA solves 

, Ht(B t K 2 B) 

(12) max 



BeK«xm tr(.fiT KQKB + B T KB) 

Similar to DICA, the solution of (|12p is obtained by solving the generalized eigen- 
value problem 

(13) —K 2 B = (KQK + K)BT . 

n 



4n practice, it is more numerically stable to solve the generalized eigenvalue problem ^L(L- 
neln^K^B = (KQK + K + \I)BT, where A is a small constant. 
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Algorithm 1 Domain-Invariant Component Analysis 

Input: Parameters A, e, and to <§; n. Sample S — {S l = {{x^\y^)} 7 ^' =1 }fL 1 . 
Output: Projection B nxm and kernel K nxn . 

I: Calculate gram matrix [Kij] k i = k{x { ^ ,x\ j) ) and [Zy]jw = l(y%\ y\ 3) )- 

2: Supervised: C = L(L + neiy 1 K 2 . 

3: Unsupervised: C = A 2 . 

4: Solve \CB = (KQK + K + XI) BY for B. 

5: Output B and K <- KBB T K. 

6: The test kernel A* <— K*BB T K where A^ tX „ is the joint kernel between test 
and training data. 



UDICA is a special case of DICA where L = —I and e — > 0. Algorithm[T]summarizes 
supervised and unsupervised domain-invariant component analysis. 

2.5. Relations to Other Methods. The DICA and UDICA algorithms gen- 
eralize many well-known dimension reduction techniques. In the supervised set- 
ting, if datasct S contains samples drawn from a single distribution ¥xy then we 
have KQK = 0. Substituting a :— KB gives the eigenvalue problem ^L(L + 
neI)~ 1 K a = AaT, which corre sponds to covariance operator inverse regression 
(COIR) (|Kim and Pavlovidl201lh . 

If there is only a single distribution then unsupervised DICA reduces to KPCA 
since KQK = and finding B require s solving the eigensystem KB — BY which 
recovers KPCA ( Scholkopf et al. 19981 ). If there are two domains, source P5 and 
target Py, then UDIC A is close l y rela ted - though not identical to - Transfer 
Component Analysis ( Pan et al. 201 lh . This follows from the observation that 
V«({P s ,Pt}) = 1 1 MP S - Mp t || 2 , see proof of Theorem[TJ 

2.6. A Learning-Theoretic Bound. We bound the generalization error of a clas- 
sifier trained after DICA-preprocessing. The main complication is that samples are 
not identically d i stribu ted. We adapt an approach to this problem developed in 
Blanchard et al.l ( 2011 ) to prove a generalization bound that applies after trans- 
forming the empirical sample using B. Recall that B = & X B. 

Define kernel k on x X as k{(P,x), {P',x')) := fc<p(P, P') • k x (x, x'). Here, k x is 
the kernel on Hx and the k ernel on distributions is fc<p(P, P') := «(mp, up/) where k 
is a positive definite kernel ( Christmann and Steinwart 2010l . lMuandet et al. 2012h . 
Let \I/qj denote the corresponding feature map. 



Theorem 5. Under reasonable technical assumptions, see Supplementary, it holds 
with probability at least 1 — S that, 

2 

sup 

l/llw<l 



< ci—tr(B J KQKB) + tr(B 1 KB) [ c 2 



AT(logi + 21ogA0 c 3 logi + C4 



N 



The LHS is the difference between the training error and expected error (with 
respect to the distribution on domains ,9 s *) after applying B. 

The first term in the bound, involving ti(B J KQKB), quantifies the distribu- 
tional variance after applying the transform: the higher the di stributional vari- 



ance, the wo rse the guarantee, tying in with analogous results in lBen-David et al 



(I2007l : l2010h . The second term in the bound depends on the size of the distortion 
tr(i? T Ai?) introduced by B: the more complicated the transform, the worse the 
guarantee. 
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The bound reveals a tradeoff between reducing the distributional variance and 
the complexity or size of the transform used to do so. The denominator of is a 
sum of these terms, so that DICA tightens the bound in Theorem [5] 

Preserving the functional relationship (i.e. central subspace) by maximizing the 
numerator in ^ should reduce the empirical risk Ep£(f(XijB) f Yi). However, a 
rigorous demonstration has yet to be found. 

3. Experiments 

We illustrate the difference between the proposed algorithms and their single- 
domain counterparts using a synthetic dataset. Furthermore, we evaluate DICA in 
two tasks: a classification task on flow cytometry data and a regression task for 
Parkinson's telemonitoring. 

3.1. Toy Experiments. We generate 10 collections of rii ~ Poisson(200) data 
points. The data in each collection is generated according to a five-dimensional zero- 
mean Gaussian distribution. For each collection, the covariance of the distribution 
is generated from Wishart distribution W(0.2 x 7 5 ,10). This step is to simulate 
different marginal distributions. The output value is y — siga(bjx + ei) ■log(|&2~x + 
c + €2\), where b\, 62 are the weight vectors, c is a constant, and ei, £2 ~ A/"(0, 1). 
Note that b\ and 62 form a low-dimensional subspace that captures the functional 
relationship between X and Y. We then apply the KPCA, UDICA, COIR, and 
DICA algorithms on the dataset with Gaussian RBF kernels for both X and Y 
with bandwidth parameters a x = a y = 1, A = 0.1, and e = 10~ 4 . 

Fig.[2]shows projections of the training and three previously unseen test datasets 
onto the first two eigenvectors. The subspaces obtained from UDICA and DICA 
are more stable than for KPCA and COIR. In particular, COIR shows a substantial 
difference between training and test data, suggesting overfitting. 

3.2. Gating of Flow Cytometry Data. Graft-versus-host disease (GvHD) oc- 
curs in allogeneic hematopoietic stem cell transplant recipients when donor-immune 
cells in the graft recognize the recipient as "foreign" and initiate an attack on the 
skin, gut, liver, and other tissues. It is a significant clinical problem in the field 
of allogeneic blood and marrow transplantation. The GvHD dataset (Brinkman 
et al.T2007) consists of weekly peripheral blood samples obtained from 31 patients 
following allogenic blood and marrow transplant. The goal of gating is to identify 
CD3 + CD4 + CD8/3 + cell s, which were found t o have a high correlation with the 



development of GvHD (jBrinkman et al.ll2007f ). We expect to find a subspace of 



cells that is consistent to the biological variation between patients, and is indica- 
tive of the GvHD development. For each patient, we select a dataset that contains 
sufficient numbers of the target cell populations. As a result, we omit one patient 
due to insufficient data. The corresponding flow cytometry datasets from 30 pa- 
tients have sample sizes ranging from 1,000 to 10,000, and the proportion of the 
CD3 + CD4 + CD8/3 + cells in each dataset ranges from 10% to 30%, depending on 
the development of the GvHD. 

To evaluate the performance of the proposed algorithms, we took data from N = 
10 patients for training, and the remaining 20 patients for testing. We subsamplc 
the training sets and test sets to have 100, 500, and 1,000 data points (cells) each. 
We compare the SVM classifiers under two settings, namely, a pooling SVM and 
a distributional SVM. The pooling SVM disregards the inter-patient variation by 
combining all datasets from different patients, whereas the distributional SVM also 
take s the inter-patient v ariatio n into account via the kern el function (Blanchard 
et al. l201lh 

(14) K{xf,x\ j) ) = fc!(P\P') • k 2 {xf,x^) 
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DICA 



Figure 2. Projections of a synthetic dataset onto the first two 
eigenvectors obtained from the KPCA, UDICA, COIR, and DICA. 
The colors of data points corresponds to the output values. The 
shaded boxes depict the projection of training data, whereas the 
unshaded boxes show projections of unseen test datasets. The fea- 
ture representations learnt by UDICA and DICA are more stable 
across test domains than those learnt by KPCA and COIR. 

Table 1 . Average accuracies over 30 random subsamples of GvHD 
datasets. Pooling SVM applies standard kernel function on the 
pooled data from multiple domains, whereas distributional SVM 
also considers similarity between domains using kernel (I14[) . With 
sufficiently many samples, DICA outperforms other methods in 
both pooling and distributional settings. The performance of pool- 
ing SVM and distributional SVM are comparable in this case. 



Methods 


Pooling SVM 

TJj = 100 n t = 500 m = 1000 


Distributional SVM 

rii = 100 m = 500 m = 1000 


Input 

KPCA 

COIR 

UDICA 

DICA 


91.68±.91 92.11±1.14 93.57±.77 
91.65±.93 92.06±1.15 93.59±.77 
91.71±.88 92.00±1.05 92.57±.97 
91.20±.81 92.21±.19 93.02±.77 
91.37±.91 92.71±.82 94.16±.73 


91.53±.76 92.81±.93 92.41±.98 
91.83±.60 90.86±1.98 92.61±1.12 
91.42±.95 91.54±1.14 92.61±.89 
91.51±.79 91.74±1.08 93.02±.77 
91.51±.89 93.42±.73 93.33±.86 



where ) and k\ is the kernel on distributions. We use fci(P I ,P J ) = 

cxp (-\\fj,ri - fJ.p 3 \\f i / 2a T) and k 2 (x^\x\ p 5 ) = exp(-||a;^ ) - \\ 2 /2a%), where jU P < 
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Table 2. The average leave-one-out accuracies over 30 subjects 
on GvHD data. The distributional SVM outperforms the pooling 
SVM. DICA improves classifier accuracy. 



Methods 


Pooling SVM 


Distributional SVM 


Input 


92.03±8.21 


93.19±7.20 


KPCA 


91.99±9.02 


93.11±6.83 


COIR 


92.40±8.63 


92.92±8.20 


UDICA 


92.51±5.09 


92.74±5.01 


DICA 


92.72±6.41 


94.80±3.81 



Table 3. Root mean square error (RMSE) of the independent 
Gaussian Process regression (GPR) applied to the Parkinson's tele- 
monitoring dataset. DICA outperforms other approaches in both 
settings; and the distributional SVM outperforms the pooling 
SVM. 



Methods 


Pooling GP Regression 


Distributional GP Regression 


motor score 


total score 


motor score 


total score 


LLS 


8.82 ± 0.77 


11.80 ± 1.54 


8.82 ± 0.77 


11.80 ± 1.54 


Input 


9.58 ± 1.06 


12.67 ± 1.40 


8.57 ± 0.77 


11.50 ± 1.56 


KPCA 


8.54 ± 0.89 


11.20 ± 1.47 


8.50 ± 0.87 


11.22 ± 1.49 


UDICA 


8.67 ± 0.83 


11.36 ± 1.43 


8.75 ± 0.97 


11.55 ± 1.52 


COIR 


9.25 ± 0.75 


12.41 ± 1.63 


9.23 ± 0.90 


11.97 ± 2.09 


DICA 


8.40 ± 0.76 


11.05 ± 1.50 


8.35 ± 0.82 


10.02 ± 1.01 



is computed using For pooling SVM, the kernel fci(P*,P ; ') is constant for any i 
and j. Moreover, we use the output kernel l{y^ ,y'i r ) = S(y^ , j/j ) where S(a, b) is 
1 if a = b, and otherwise. We compare the performance of the SVMs trained on the 
preprocessed datasets using the KPCA, COIR, UDICA, and DICA algorithms. It is 
important to note that we are not defining another kernel on top of the preprocessed 
data. That is, the kernel k 2 for KPCA, COIR, UDICA, and DICA is exactly ©. 
We perform 10-fold cross validation on the parameter grids to optimize for accuracy. 

Table [T] reports average accuracies and their standard deviation over 30 repe- 
titions of the experiments. For sufficiently large number of samples, DICA out- 
performs other approaches. The pooling SVM and distributional SVM achieve 
comparable accuracies. The average leave-one-out accuracies over 30 subjects are 
reported in Table [5] (see supplementary for more detail). 

3.3. Parkinson's Telemonitoring. To evaluate DICA in a regression setting, we 
apply it to a Parkinson's telemonitoring datase10. The dataset consists of biomed- 
ical voice measurements from 42 people with early-stage Parkinson's disease re- 
cruited for a six-month trial of a telemonitoring device for remote symptom pro- 
gression monitoring. The aim is to predict the clinician's motor and total UPDRS 
scoring of Parkinson's disease symptoms from 16 voice measures. There are around 
200 recordings per patient. 

We adopt the same experimental settings as in N3.21 except that we employ 
two independent Gaussian Process (GP) regression to predict motor and total UP- 
DRS scores. For COIR and DICA, we consider the output kernel l(y^\y[^) — 
exp(— 1| 2/^ — y^ \\ 2 /2a%) to fully account for the affinity structure of the output 

http: / /archive . ics .uci . edu/ml/datasets/Parkinson' s+Telemonitoring 
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ii 



Input 



Figure 3. The root mean square error (RMSE) of motor and total 
UPDRS scores predicted by GP regression after different prepro- 
cessing methods on Parkinson's telemonitoring dataset. The top 
and middle rows depicts the pooling and distributional settings; 
the bottom row compares the two settings. Results of linear least 
square (LLS) are given as a baseline. 



variable. We set 03 to be the median of motor and total UPDRS scores. The voice 
measurements from 30 patients are used for training and the rest for testing. 

Fig. [3] depicts the results. DICA consistently, though not statistically signifi- 
cantly, outperforms other approaches, see Table |21 Inter-patient (i.e. across do- 
main) variation worsens prediction accuracy on new patients. Reducing this varia- 
tion with DICA improves the accuracy on new patients. Moreover, incorporating 
the inter-subject variation via distributional GP regression further improves the 
generalization ability, see Fig. [3] 

4. Conclusion and Discussion 

To conclude, we proposed a simple algorithm called Domain-Invariant Compo- 
nent Analysis (DICA) for learning an invariant transformation of the data which 
has proven significant for domain generalization both theoretically and empirically. 
Theorem [5] shows the generalization error on previously unseen domains grows 
with the distributional variance. We also showed that DICA generalizes KPCA 
and COIR, and is closely related to TCA. Finally, experimental results on both 
synthetic and real-world datasets show DICA performs well in practice. Interest- 
ingly, the results also suggest that the distributional SVM, which takes into account 
inter-domain variation, outperforms the pooling SVM which ignores it. 
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The motivating assumption in this work is that the functional relationship is 
stable or varies smoothly across domains. This is a reasonable assumption for au- 
tomatic gating of flow cytometry data because the inter-subject variation of cell 
population makes it impossible for domain expert to apply the same gating on all 
subjects, and similarly makes sense for Parkinson's telemonitoring data. Never- 
theless, the assumption does not hold in many applications where the conditional 
distributions are substantially different. It remains unclear how to develop tech- 
niques that generalize to previously unseen domains in these scenarios. 

DICA can be adapted to novel applications by equipping the optimization prob- 
lem with appropriate constraints. For example, one can formulate a semi-supervised 
extension of DICA by forcing the invariant basis functions to lie on a manifold or 
preserve a neighborhood structure. Moreover, by incorporating the distributional 
variance as a regularizer in the objective function, the invariant features and clas- 
sifier can be optimized simultaneously. 

Acknowledgments. We thank Samory Kpotufe and Kun Zhang for fruitful discus- 
sions and the three anonymous reviewers for insightful comments and suggestions 
that significantly improved the paper. 
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Table 4. Comparison of domain generalization with other well- 
known frameworks. Note that the domain generalization is closely 
related to multi-task learning and domain adaptation. The dif- 
ference of domain generalization is that one does not observe the 
target domains in which a classifier will be applied without retrain- 
ing the classifier. 



Framework 


Distribution Mismatch 


Multiple Sources 


Target Domain 


Standard Setup 


X 


X 


X 


Transfer Learning 


/ 


X 


/ 


Multi-task Learning 


/ 


/ 


X 


Domain Adaptation 


/ 


/ 


/ 


Domain Generalization 


/ 


/ 


X 



Appendix A. Domain Generalization and Related Frameworks 



The most fundamental assumption in machine learning is that the observations 
are independent and identically distributed (i.i.d.). That is, each observation comes 
from the same probability distribution as the others and all are mutually indepen- 
dent. However, this assumption is often violated in practice, in which case the 
standard machine learning algorithms do not perform well. In the past decades, 
many techniques have been proposed to tackle scenarios where there is a mismatch 
betw een training and te st distributions. These include domain adaptat ion fBickel 



et al. | 2009al ) . multitask learning ( Caruana 19971). transfer learn ing ( Pan and Yangj 
2010b), covariate/dataset shift ( Quionero-Candela et al. 2009) and concept drift 
( Widmer and Kurat 1996f ). To better understand domain generalization, we briefly 
discuss how it relates to some of these approaches. 



A.l. Transfer learning (see e.g., Pan an d Yang ( 2010bh and references 
therein). Transfer learning aims at transferring knowledge from some previous 
tasks to a target task when the latter has limited training data. That is, although 
there may be few labeled examples, "knowledge" obtained in related tasks may 
be available. Transfer learning focuses on improving the learning of the target 
predictive function using the knowledge in the source task. Although not identical, 
domain generalization can be viewed as a transfer learning when knowledge of the 
target task is unavailable during training. 



A. 2. Multitask learning (see e.g. JCaruan a (1997) and references therein). 

The goal of multitask learning is to learn multiple tasks simultaneously - especially 
when training examples in each task are scarce. By learning all tasks simulta- 
neously, one expects to improve generalization on individual tasks. An important 
assumption is therefore that all the tasks are related. Multitask learning differs from 
domain generalization because learning the new task often requires retraining. 



Bick el et al.l fl2009a) and references 



A. 3. Domain adaptation (see e.g. 
therein). Domain adaptation, also known as covariate shift, deals primarily with a 
mismatch between training and test distributions. Domain generalization deals with 
a broader setting where training instances may have been collected from multiple 
source domains. A second difference is that in domain adaptation one observes the 
target domain during the training time whereas in domain generalization one does 
not. 

Table U summarizes the main differences between the various frameworks. 
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Appendix B. Proof of Theorem Q] 
Lemma 6. Given a set of distributions V = {P 1 , P 2 . . . , P^}, the distributional 



variance of V is Y-h(V) = X^iIIMp 4 — ^fWn where fip — {1/N) Yli—i ^P 4 an< ^ 



N 



1_ 

N 



Proof. Let P be the probability distribution denned as (1/N) Yli=i P") i- e -i ^( x ) — 
0-/N) Si=i P l ( x )- L follows from the linearity of the expectation that fXf = 
J2iLi MP 4 - F° r brevity, we will denote (-,-)n by ('>')• Then, expanding ([3]) 

gives 



1 1 1 N 

nCP) = ^r(E) = -tr(G) - - 2 £ G« 

i,3=l 



N 
1 

N 
1 

iV 
1 

N 



N 



which completes the proof. 



N 

E^' 

i=l 


1 W 

- jp E 

i,3=l 


) MP* ) 


N 

E^ F 

i=l 


2 N 

2,3=1 


i 


V 

E^ F 

i=l 


V / 

4 > MP 4 ) — 2 X ( /%« ; 

i=l \ 


AtE^) 
j'=i / 


" N 

E^ p 

,i=l 


V 




N 

E 

i=l 


»i , /x P i } — 2 • (/i P i , /ip 


+ (/#,/#)) 


V 

E^ 

i=l 


|2 





V 



'•.;=i 



v 



v 



TV 



3=1 



□ 



Theorem [T] for a characteristic kernel k, V-h(7 :> ) = if and only if P 1 = P 2 = 

. . . — 1U>N 



Proof. Since k is characteri stic, ||/zp — Moll?/ is a metri c and is zero iff P = Q for 
any distributions P and Q ( Sriperumbudur et al. 2010h . By Lemma HI Y-uiV) = 



7? E^IiIIMp 4 ~ »p\\h- Thus > IIMp* " Mp||« = iff r = P. Consequently, if V n (V) 
is zero, this implies that P 4 = P for all i, meaning that P 1 = • • • = P e . Conversely, 
if P 1 = • • • = ¥ e , then ||/x P « - /ip||^ = is zero for all i and thereby Y H {V) = 
i?E i= illMP 4 -MpIIk is zero. □ 



Appendix C. Proof of Theorem [2] 

Theorem [2] The empirical estimator Y u{S) = -^-tr(E) = tr(i^Q) obtained from 
Gram matrix 

1 1 k=l 1=1 
is a consistent estimator ofY-uiV). 
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Proof. Recall that 

1 1 N ~ 1 ~ 1 N ~ 

V«(P) = ^tr(G) ~ W2 J2 G ^ and V «( 5 ) = ^ tr ( G ) - E G « 

where 



G 



„(') 



nin i fe =i ;=i 

By Theorem 15 in lAltun and Smola ( 2006f ). we have a fast convergence of ftp to 



/ip. Consequently, we have G — > G, which implies that V«(5) — > V-^('P). Hence, 
V-h(S) is a consistent estimator of Y-h(V). □ 

Appendix D. Derivation of Eq. ([8j 

DICA employs the covariance of inverse regressor V(E[0(A")|Y"]), which can be 
written in terms of covariance operators. Let H and T be the RKHSes of X and 
Y endowed with reproducing kernels k and Z, respectively. Let E xx , E yy , £ xy , and 
E yx be the covariance operators in and between the corresponding RKHSes of X 
and Y , We define the conditional covariance operator of X given Y, denoted by 

(15) ^xx|y = ^xx — ExyEyyEyx . 

The following theorem from Fukumizu et al. ( 2004rJ ) states that, under mild 



conditions, £ xx | y equals the expected conditional variance of <p(X) given Y. 

Theorem 7. For any f e H, if there exists g £ T such that K[f(X)\Y] = g(Y) for 
almost every Y, then £ xx | y = E[V(0(A")|Y")]. 

Using the E-V-V-E identitjQ, the covariance V(E[</>(A)|Y]) can be expressed in 
terms of the conditional covariance operators as follow: 

(16) V(E[0(X)|Y]) - Y(ct>(X)) - E\V(<f>(X)\Y)], 

assuming that the inverse regressor E[f(x)\y] is a smooth function of y for any 

fen. 

By virtue of Theorem [JJ the second term in the r.h.s. of (fTB")) is S xx | y . Since 
Y(4>(X)) = Cov(<f>(x),(/)(x)) — S xx , it follows from (fT5|) that the covariance of the 
inverse regression V(E[0(X)]|Y) can be expressed as 

(17) v(E[^(x)|y]) = s xy s- 1 s yx . 

The covariance (|17p can be estimated from finite samples (xi,yi), . . . , (x n ,y n ) 
by V(E[0(X)|F]) = ExyS-^gyx where E xy = and = .^( Xn )] 

and <& y — [tp(yi), • • ■ , (p(y n )]. Let K and L denote the kernel matrices computed 
over samples {x\, x%, . . . , x n } and {2/1,2/2, ■ • ■ , Un}, respectively. We have 

V(E[0(X)|Y]) - f^*^) (^(*»*» + (^ $ * 

(18) =i$ a .L(L + neJ n ) -1 $J 

n 



3 V(X) = E[V(X| Y)] + V(E[X| Y]) for any X, Y. 
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where L = $J $ y and X is the identity operator. The second equation is obtained 
by applying the fact that ($ y $y + nel)$ y = $ a + nel n ). 

Finally, using E xx = -^$ X &J and recalling that K — & J& X , we obtain 

n I \ n 





n 














bjb fc 





and 

as desired. 

Appendix E. Derivation of Lagrangian JT 
Observe that optimization 

. N tr (B T XB) 

(19) max — . z-^f— — ^ 

is invariant to rescaling B ^ a ■ B. Optimization (|19p is therefore equivalent to 

max tr (B T XB) 

subject to: tr (B T YB) = 1, 

which yields Lagrangian 

(20) £ = tr(B T XB) -tr((B T yB-/)r) . 

Appendix F. Proof of Theorem O 

We consider a scenario where distributions P* are drawn according to S?* with 
probability Introduce shorthand Xy for (pW,^-) for a distribution on 
and a corresponding random variable on X. 

The quantity of interest is the difference between the expected and empirical loss 
of a classifier / : ty x X X — > y under loss function £ : y x y — > R + . 

Assumptions. The loss function I : R x y — >• R+ is ^-Lipschitz in its first 
variable and bounded by Ue- The kernel k x is bounded by Ux- Assume that all 
distributions in are mapped into a ball of size Um by \l/<p. Finally, since fep is 
a is a square exponential, there is a constant Lqj such that 

||^p(*0 — $<p(iu)|| < Lrp\\v — w\\ for all v,w. 

Recall that N is the number of sampled domains, rii is the number of samples 
in domain i, and n = J^iLi n i 1S the total number of samples. The proof assumes 
rii = rij for all 

Theorem [5j Assumes the conditions above hold. Then with probability at least 
1-5 

2 



sup 

II/IIk<i 



5£JM(/(XyjB), Yi) - E p £(f(X i3 B),Y z ) 



< ci-j-tr(B J K LK B) + tr(B T KB) ■ QogS-^N) + ^ , c 4 ' 



A v ' v 7 V n N N , 
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Remark 1. Recall that $> x = [cj)(xi), . . . , <f>(x n )]. The composition Xt i— > kj • B, 
where k t = [k(xi, Xt), . ■ . , k(x n , Xt)], can therefore be rewritten as <fi(xt) ■ B — 4>{xt) ■ 
• B. 



Proof. The proof modifies the approach taken in iBlanchard et al.l f|201 lh to handle 
the preprocessing via transform B, and the fact that we work with squared errors. 
Parts of the proof that pass through largely unchanged are omitted. 

We repeatedly apply the inequality |a+fe| 2 < 2|a| 2 +2|6| 2 . However, we only incur 
the multiplication-by-2 penalty once since \a\ - 

Decompose 



+ a n \ 2 < 2|ai| 2 H h2|a„| 2 . 



sup 

II/IIm<i 



l%E v l{f{X l3 B), Y t ) - E f t{f(X l3 B), Y t ) 



A 



< sup — y lE^EriifiXijE), Y*) - E P d(f(X l3 B), Y t ) 

\\l\\u<l N ^ 



A 



sup -V E P ,^/(i, 3 i3),y ! )-E fl f(/(i !J B),y ! ) 

ll/llw^l^ 1 



t=i 

N 



ll/lk<i ^ T^i 1 
= (A) + (B) + (C) . 

Control of (C): 



A' 



(C)= sup - V ^(/(l^F*)- E^(/(X tf B),yi) 



A' 



<4>l sup E p ,/(5 ii B)-%/(XtfB) 
II/IIh<i ^ ~? 1 



n 



N 



i=l 



Note that ||*fp(/u(P))|| 2 < • \\fj, P \\ 2 < L-pKp. Therefore, 

n N 



(C) < 4> 2 e L v Uy— ||Mp^ - /^B|| 



By the proof of Theorem [T] and since $j£> = KB, we have 



(C) < 2(j) 2 e L v U v —tr(KBB~<KL). 



Control of (B): Similarly, 



N 



(B) = sup - V \E P d(f(X tJ B), Y) - E f J(f(X 2J B), Y) 



< 2(/) 2 L v U v ■ — y ||mip<£ ~ Mp.^l) 2 
1 W 
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Here we follow the strategy applied by iBlanchard et aL I (120111) to control their 
term (I) in Theorem 5.1. Assume = rij for all i,j and recall n = 53j=i n * so 
n,j = n/iV for all i. 

By Hoeffding's inequality in Hilbert space, with probability greater than 1 — 8 
the following inequality holds 

2 



1 



< 9U X 



N ■ log 26~ 



Applying the union bound obtains 

(lb) < l&ftLyUyUx ■ \\B\\ 2 HS 

Control of (A): 



N ■ (log 5- 1 + 2 log N) 



N 



(A) = sup - E*^E P ^(/(A%-£), Yi) - E P ^(/(A\-£), 



||/lk<i N i=1 



Following the strategy used by IBlanchard et al. ( 2011 ) to control (II) in Theorem 
5.1, we obtain 



N 



■\\B\ 



HS- 



End of proof: We have that K is invertible since Y* xx is assumed to be invertible. 
It follows that the trace tr(_B T KB) defines a norm which coincides with the Hilbert- 
Schmidt norm Combining the three inequalities above concludes the proof. 

□ 

Appendix G. Leave-one-out accuracy 

Figure 2] depicts the leave-one-out accuracies of different approaches evaluated 
on each subject in the dataset. Average leave-one-out accuracies are reported in 
Table [5] The distributional SVM outperforms the pooling SVM in this setting, 
possibly because of the relatively large number of training subjects, i.e., 29 subjects. 
Using the invariant features learnt by DICA also gives higher accuracies than other 
approaches. 
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Figure 4. The leave-one-out accuracy of different methods evalu- 
ated on each subject in the GvHD datasct. The top figure depicts 
the pooling setting, whereas the bottom figure depicts the distri- 
butional setting. 



